Dataset

Red Wine Quality provided by Udacity

Description

This tidy data set contains 4,898 white wines with 11 (major) variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Guiding Question

Which chemical properties influence the quality of white wines?

Introduction

This dataset contains the informationabout white wine quality. Wines have been rated by some experts. I have limited knowledge on wines but, it would really interesting to see the ingredient vs qualty pattern of white wines. By this exploration mechanism, I would like to gain some insights about what are the chemicals, specific ingredients which makes a wine taste better. This exploratory analysis could be used by wine makers.

Exploring the dataset

Here we explore the dataset as follows

## [1] "No of data points: 4898"
## [1] "No of features: 13"
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

One can notice the various chemicals (ingredients) of white wines. Now that we have seen the variables, I would like to plot some variable one by one. This is to see what insights they can give.

Univariate Plots Section

Here I explore the univariate plots. As part of the univariate analysis I would like to explore the various features of the wine dataset and see thier patterns, find something insightful.

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

Observations from the Summary:

  • There is a big range for sulfur.dioxide (both Free and Total) across the samples.
  • The alcohol content varies from 8.00 to 14.20 for the samples in dataset.
  • The quality of the samples range from 3 to 9 with 6 being the median.
  • The range for fixed acidity is quite high with minimum being 3.8 and maximum being 14.2,
  • pH value varies from 2.720 to 4.010 with a median being 3.820.

I feel that density of a wine and its alcohol content are most important ingredients. But before I explore them I would like to explore the Quality Rating of each wine given by the experts and its distribution.

The bar chart here tells that the most of the wines were given an average rating of 6 (in the range 5-7). The quality is normal distribution as shown in barplot.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5
## [1] 5.877909

As expected the average quality rating is 5.87.

The above box plot shows that most of the wines have a density of around 0.995.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

To ground this further, summary stats show that median density is 0.9937, mean density is 0.9940, and density values range from 0.9917 to 0.9961 in the inter-quartile range (within 1st and 3rd Quartiles).

Now I want to the amount of alcohol content in white wines. Wow!No outliers to be seen in this distribution. Most of the wines seems to have an alcohol content of 9.5 - 11.5 units. However this IQR is wide. It would be really interesting to see its variation with quality or density in the bivariate analysis.

As per my knowledge, wine experts use their senses to taste wines: sight, smell, taste. The different chemical ingredients account for the various senses of the wine. For example, residual sugar make a sweetness, citric acid is related to a freshness, and acid or tannin make an astringent taste. So, I’m interested in citric acid, residual sugar, and fixed acidity.

First, I would like to explore the fixed acidity feature of the dataset.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Notice how the fixed acity histogram is also normally distributed. Most of white wines have 6~7 (g/dm^3) of fixed acidity.

I would now explore the citric acid content of the wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Citric acid distribution looks normal distribution. Most of white wines have 0.3 (g/dm^3) of citric acid. There is an interesting peak near 0.5 (g/dm^3). I wonder why is this.

I would now explore the residual sugar content of the wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Residual sugar distribution is skewed. The largest spikes in the 1~2 (g/dm^3). This distributions tells that very sweet wine is rare.

Now that I want to see the distribution of volatile acidity (to know how different it is from fixed acidity), pH (this will tell weather the wine is acidic or basic overall), chlorides (to know its salt contents).

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

Volatile acidity seems normal distribution. Most white wines are 0.2 acidity.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

pH is also normally distributed. Median of pH values of the wine in the dataset is 3.18.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Chloride is also normally distributed till 0.1 chloride conent, but unusually distributed has small number of data points with cloride more than 0.1. Most of the wines have 0.045 cloride content.

Finally as part of the univariate plots, I want to see the distribution of sulphate values and sulphur dioxide content.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800

Sulphates seems normal distribution. Most white wines have 0.5 sulphates.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

Free sulfur dioxide seems normal distribution. Most white wines have 34 free sulfur dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

Total sulfur dioxide seems normal distribution. Most white wines have 130 total sulfur dioxide.

Univariate Analysis

What is the structure of your dataset?

There are 4898 observations and 13 features. Input variables which includes 11 chemical features of white wine and output variable which is wine quality. The quality of the wine is an integer variable which has has a min 3.0 and max 9.0, with a median 6.0 and mean 5.878.

All the chemical property variables are floating numbers. They are of different unit and therefore lie in widely different range. For example, the chlorides variable has a small range from 0.009 to 0.346, while the total.sulfur.dioxide variable has a large range from 9.0 to 440.0.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set are alcohol and quality. I suspect alcohol and some combination of other variables can be used to build a predictive model to the wine quality. I would like to explore two variables in bivariate analysis.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Features such as residual sugar, sulphates, pH, chlorides will likely contribute to the wine quality and will support our investigation.

Did you create any new variables from existing variables in the dataset?

No. So far, I havent created any new variables as all variable seems to tidy.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Chloride content distribution

During the investigation, I found the distribution of chlorides variable has an unusual distribution. From the histogram of chloride, we see that the majority of samples lie in the range of [0, 0.1] in a normal distribution shape, but there are a small number of outliers that lie far beyond this normal range (up to 0.34), which indicates this is a long-tail distribution.

In order to better visualize this distribution, I would like to Cut off the samples that are beyond 0.1, and only “zoom in” to look at those in the “regular range” All the three plots individually show normal distribution.

Now that I have explored some individual variables, I would like to know their relationships with each other. We start with the Bivariate plot section next.

Bivariate Plots Section

I wish to know if there is any correlation between various features.

Pearson’s Corrlation

##                                 X fixed.acidity volatile.acidity
## X                     1.000000000   -0.25581431      0.002857966
## fixed.acidity        -0.255814305    1.00000000     -0.022697290
## volatile.acidity      0.002857966   -0.02269729      1.000000000
## citric.acid          -0.149899918    0.28918070     -0.149471811
## residual.sugar        0.006623775    0.08902070      0.064286060
## chlorides            -0.045645192    0.02308564      0.070511571
## free.sulfur.dioxide  -0.011928911   -0.04939586     -0.097011939
## total.sulfur.dioxide -0.161979037    0.09106976      0.089260504
## density              -0.185976097    0.26533101      0.027113845
## pH                   -0.115774132   -0.42585829     -0.031915368
## sulphates             0.009807759   -0.01714299     -0.035728147
## quality               0.035763247   -0.11366283     -0.194722969
##                       citric.acid residual.sugar   chlorides
## X                    -0.149899918    0.006623775 -0.04564519
## fixed.acidity         0.289180698    0.089020701  0.02308564
## volatile.acidity     -0.149471811    0.064286060  0.07051157
## citric.acid           1.000000000    0.094211624  0.11436445
## residual.sugar        0.094211624    1.000000000  0.08868454
## chlorides             0.114364448    0.088684536  1.00000000
## free.sulfur.dioxide   0.094077221    0.299098354  0.10139235
## total.sulfur.dioxide  0.121130798    0.401439311  0.19891030
## density               0.149502571    0.838966455  0.25721132
## pH                   -0.163748211   -0.194133454 -0.09043946
## sulphates             0.062330940   -0.026664366  0.01676288
## quality              -0.009209091   -0.097576829 -0.20993441
##                      free.sulfur.dioxide total.sulfur.dioxide     density
## X                          -0.0119289106         -0.161979037 -0.18597610
## fixed.acidity              -0.0493958591          0.091069756  0.26533101
## volatile.acidity           -0.0970119393          0.089260504  0.02711385
## citric.acid                 0.0940772210          0.121130798  0.14950257
## residual.sugar              0.2990983537          0.401439311  0.83896645
## chlorides                   0.1013923521          0.198910300  0.25721132
## free.sulfur.dioxide         1.0000000000          0.615500965  0.29421041
## total.sulfur.dioxide        0.6155009650          1.000000000  0.52988132
## density                     0.2942104109          0.529881324  1.00000000
## pH                         -0.0006177961          0.002320972 -0.09359149
## sulphates                   0.0592172458          0.134562367  0.07449315
## quality                     0.0081580671         -0.174737218 -0.30712331
##                                 pH    sulphates      quality
## X                    -0.1157741316  0.009807759  0.035763247
## fixed.acidity        -0.4258582910 -0.017142985 -0.113662831
## volatile.acidity     -0.0319153683 -0.035728147 -0.194722969
## citric.acid          -0.1637482114  0.062330940 -0.009209091
## residual.sugar       -0.1941334540 -0.026664366 -0.097576829
## chlorides            -0.0904394560  0.016762884 -0.209934411
## free.sulfur.dioxide  -0.0006177961  0.059217246  0.008158067
## total.sulfur.dioxide  0.0023209718  0.134562367 -0.174737218
## density              -0.0935914935  0.074493149 -0.307123313
## pH                    1.0000000000  0.155951497  0.099427246
## sulphates             0.1559514973  1.000000000  0.053677877
## quality               0.0994272457  0.053677877  1.000000000

High correlations (≥ 40% in absolute value) are identified and marked in red. Pairwise scatterplots are also shown below.

Scatterplot of Predictors

Relationship between Total Sulphur Dioxide and Quality

Higher quality wines seems to have lower levels of total sulphur dioxide as the median value seems to fall with increase in quality . The highest rated wine has the least total Sulphur dioxide content .

Relationship between Alcohol and Quality

Higher quality wines seems to have higher levels of alcohol as the median value seems to rise consistently with increase in quality . The highest rated wine has the highest alcohol content .

Relationship between Density and Residual Sugar

There seems to be a direct correlation between density and Residual Sugar as they seem to be positively corelated . This makes sense since increasing the residual sugar , the mass will increase .Thus density is directly propotional to mass (Since density = mass/volume).

Relationship between Quality and Residual Sugar/Citric Acid

The ratio of residual sugar and citric acid seems to play a high role in quality . This can be explained by the fact that good quality wines are crisp and dry . Check the link in references for further explaination .

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I tried finding the relationship between() Quality of wine vs. residual sugar,citric acid ratio),(Quality of wine and total sulphir dioxide) and (Quality of wine and alcohol) . The quality of wine seemed to be positively correlated with alcohol content . However there was negative correlation between quality and total sulphur dioxide . The residual sugar and citric acid ratio seems to play an important role . This is because they directly affect the cripiness/dryness of wines . Good wines tend to be crisp and dry. Thus, good wines have high acidity and lower sugar levels . You can check the references for more information about ‘crispiness’ of wines.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

All plots were as expected, so nothing was extraordinary . The relationship between density and residual sugar was quite straightforward . Since the denisity is directly propotional to mass . Higher levels of sugar tend to increase the mass .

What was the strongest relationship you found?

Quality of wine and and alcohol seems to highly correlated. The higher the alcohol, higher the quality of wine.

Multivariate Plots Section

Next we will explore the interaction between multiple varaibles .

Relationship between Citric Acid, alcohol and pH value

The pH indicates whether a wine is acidic or alkaline. Citric acid and alcohol seems to increase the pH value. This makes wine more crispy/dry .

Relationship between Quality and Residual Sugar/Citric Acid

It can be seen clearly that high quality wine tend to be less sweet and more crispy . This is due to higher levels of citric acid and less sugar. This makes the wine more dry . Also , alcohol level is positively correlated to the quality of wine .

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

I tried finding the relationship between acidity(pH) vs citric acid and alcohol. I found out that pH value seemed to increase with increase in of alcohol while citric acid’s quantity is fixed.

Were there any interesting or surprising interactions between features?

Yes I found out that good quality wine seemed to have lower level of sugar . Also they had high levels of alcohol . Good quality wines also seem to have a good ratio of citric acid vs residual sugar . This is done to ensure that wine remains crispy and dry . Good quality wines also had high levels of alcohol .

Final Plots and Summary

First Plot

Description of first plot

This plot basically suggests that majority of wines have rating 5, 6 or 7 . This plot also follows a normal distribution .

Second Plot

Description of second plot

Higher quality wines seems to have higher levels of alcohol as the median value seems to rise consistently with increase in quality . The highest rated wine has the highest alcohol content .

Third Plot

Description of third plot

Sulphur Dioxide is basically used as preserving agent in wine . However, it presence produces a pungent aroma which is undesirable in wines .Higher quality wines seems to have lower levels of total sulhpur dioxide as the median value seems to fall consistently with increase in quality . The highest rated wine has the least total sulphur dioxide . Check the references for more information on sulhpur dioxide .

Reflection

The dataset seemed to be quite long and interesting . After performing the analyis, I learnt a great deal about wines . After performing analysis, I found out many factors that affect quality of wine. These include alcohol , sulphur dioxide, residual sugar and citric acid . Higher the level of alcohol the better the wine . The opposite is true for Sulphur Dioxide . It negatively effects the quality of wine . Good quality wines seems to have a good ratio of citric acid and sugar level maintained. This ensures that wine is crispy and dry .

Initially I had great trouble understanding the different factors . I had to p study about fermentation process to better understand these factors . I believe if I had more knowledge about chemistry I could have imporved my analysis . Some sort of feature engineering would have definitely helped as well . Also some sort machine learning model can be used to predict quality for future analysis as well . This will also help to understand the relationship between quality of wine and various factors .